feat: add collections validation and GFQL support by lmeyerov · Pull Request #874 · graphistry/pygraphistry

lmeyerov · 2025-12-29T15:01:28Z

Summary

Add collections validator with strict/autofix behavior and GFQL wire normalization
Expose collections API with validate/warn and plot-time URL param validation
Add collections tests and clarify plan template location
Move collections helpers to graphistry/collections.py and introduce typed models in graphistry/models/collections.py

Key Features

g.collections(...) API for defining subsets via GFQL expressions with priority-based visual encodings
Helper constructors graphistry.collection_set(...) and graphistry.collection_intersection(...)
Support for showCollections, collectionsGlobalNodeColor, and collectionsGlobalEdgeColor URL params
Automatic JSON encoding and GFQL AST/Chain/wire-protocol normalization

Validation Behavior

Strict mode: Raises ValueError on invalid input
Autofix mode (default): Drops invalid collections with warnings
Set collections require an id field (server requirement) - missing IDs are warned/dropped, not auto-generated
Intersection collections cross-validate that referenced set IDs exist
GFQL parsing uses _wrap_gfql_expr as canonical implementation with precise exception handling

Testing

python -m pytest graphistry/tests/test_collections.py graphistry/tests/test_dataset_id_invalidation.py
./bin/lint.sh
./bin/mypy.sh

Review Comments Addressed

See comment for detailed breakdown of all review comments and their resolutions.

mj3cheun · 2026-01-12T19:15:02Z

general comment, validate_mode = 'autofix' seems more likely to produce invalid output and/or result in a collection getting skipped than otherwise. to me its biggest value is typecasting stuff like id=1 to id="1" etc. not sure how much value there is in making it explicit, maybe we just have warn true/false and otherwise handle things silently?

lmeyerov · 2026-01-12T19:33:26Z

hmm, if they do something like g.plot(validate_mode='autofix'), what do we want to happen , or not happen, w/ collections?

that was mostly added b/c people keep having dirty data that fails arrow conversion and rather seeing it load vs fixing their data/cfg

mj3cheun · 2026-01-12T19:45:35Z

@lmeyerov regarding general comment above: im thinking maybe instead of validate we call it strict with "true" or "false" where strict true will throw errors and false will pass over non-compliant collections

autofix sorta implies to me that we are "correcting" the data when invalid but we arent really i think? more just passing it over if there are any mistakes

if we feel we need to type cast we might want to just do it anyway strict true or not

lmeyerov · 2026-01-12T20:01:33Z

the issue with strict true/false is that's closer to what we had before, and users were complaining that they just wanted it to 'work', hence autofix (coerce++). validate=true/false is closer to what you're thinking, while 'autofix' (coerce) is "it'll run, but may not be what you want, but you said wanted soemthing that runs"

We therefore have leeway in what autofix does --- we just need to warn (if warn=auto/true), and do what. So the q is... what should it do wrt diff collections errors, if strong opinions about any params in any direction?

My default intuition is probably:

drop collections with invalid gfql
colors: random-but-deterministic? or neutral grey / transparent? (default black looks buggy)
others: ??? disable / see if there's default values ?

mj3cheun · 2026-01-12T20:08:19Z

agree with that, i think we do the 3 bullets listed if strict (or whatever we want to call it) is false and its almost what we are doing right now. theres only 1 point i would make about the current implementation

instead of just trying to typecast stuff (which a lot of the time doesnt work), i would prefer we do default values or in the case of colours, random colours

in summary its only the approach i think might need to change, not the intention

EDIT: if we do the above it actually gets closer to a true autofix than what it was before, so i dont have an issue with the name anymore

aucahuasi

Thanks for this PR, I left some comments and I think we need to solve this feedback before merge:
https://github.com/graphistry/pygraphistry/pull/874/changes#r2778233409

- Remove dead code after return in _normalize_intersection_expr - Add empty sets list validation in _normalize_sets_list - Remove pointless type coercion (str() won't produce valid types) - Consolidate GFQL parsing: _normalize_gfql_ops calls _wrap_gfql_expr - Add cross-validation for intersection set ID references - Require id field for sets (server requires it, no auto-generation) - Use precise exception handling (TypeError, ValueError, GFQLValidationError) - Update tests to include required id fields 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lmeyerov · 2026-02-08T04:06:41Z

Review Comments Addressed (commits `23d1f00`, `d2bdf02`, `47d87e8`)

HIGH Priority - Fixed

Comment	Author	Issue	Resolution
#2778239385	@aucahuasi	Dead code after return (line 232)	✅ Removed unreachable code
#2778238588	@aucahuasi	Validate `len(sets_list) > 0`	✅ Added empty sets validation in `_normalize_sets_list`
#2778244203	@aucahuasi	Cross-validate intersection set IDs	✅ Added `_validate_intersection_references()` post-loop validation
#2778233409	@aucahuasi	Consolidate GFQL parsing with collections.py	✅ `_normalize_gfql_ops` now calls `_wrap_gfql_expr` from `collections.py`
#2778241945	@aucahuasi	Validate `set['id']` is required	✅ Added validation warning; drops in autofix mode (no auto-generation - user must provide meaningful IDs)
#2683481643	@mj3cheun	Type string coercion pointless	✅ Removed `str(collection_type)` - now continues/fails since coercion won't produce valid types

MEDIUM Priority - Fixed

Comment	Author	Issue	Resolution
#2683617536	@mj3cheun	Use Chain.from_json for GFQL parsing	✅ Now uses `_wrap_gfql_expr` which handles Chain/AST conversion canonically
#2683627736	@mj3cheun	Dead if statement (intersection type check)	✅ Actually NOT dead - validates `expr.type` vs `collection.type` mismatch; kept for safety

LOW Priority - Fixed

Comment	Author	Issue	Resolution
#2683343387	@mj3cheun	Redundant `normalize_validation_params` call	Keep for API safety - ensures consistent param handling regardless of caller
#2683393920	@mj3cheun	Inline `_coerce_collection_list`	✅ Inlined into `_parse_collections_input`

Still Open / Deferred

Comment	Author	Issue	Status
#2683317737	@mj3cheun	Auto-detect pre-encoded strings	Deferred - `encode` param was removed; always canonicalizes now
#2669705397	@lmeyerov	"back out" plot-time validation	Kept - acts as safety net for bypassing `.collections()` method

Key Design Decisions

No ID auto-generation: Server requires IDs but client should not generate arbitrary ones like set_0. Collections without IDs are dropped in autofix mode with a warning. Users must provide meaningful IDs.
Precise exception handling: Changed except Exception to except (TypeError, ValueError, GFQLValidationError) to avoid masking bugs.
Cross-validation for intersections: New _validate_intersection_references() ensures intersection sets reference actual collection IDs. Dangling references are caught before hitting the backend.
GFQL parsing consolidation: _normalize_gfql_ops now delegates to _wrap_gfql_expr from collections.py as the canonical implementation.
Proper type handling: Used dict(entry) to convert TypedDicts to plain dicts at runtime for uniform handling, avoiding unchecked static casts.

CI Status: All jobs green (manual workflow_dispatch run 21793451172) - python-lint-types 3.8-3.14 ✅, all test suites ✅

aucahuasi

Thanks for fixing the previous issue, here are some others that caught my attention:

https://github.com/graphistry/pygraphistry/pull/874/changes#r2789010065
https://github.com/graphistry/pygraphistry/pull/874/changes#r2789125932

lmeyerov · 2026-02-11T09:04:13Z

Thanks for the thorough reviews! All comments have been addressed:

Consolidation & Architecture:

Moved GFQL normalization to compute/ast.py:normalize_gfql_to_wire() as canonical implementation
Clean import graph: models → compute → helpers/validation
collections.py and validate_collections.py both call compute/ast

Validation Improvements:

Intersection DAG validation: supports intersections-of-intersections, detects self-references and cycles
Added AssertionError to exception handling (AST from_json methods use bare asserts)
Empty sets list validation
Cross-validation of intersection set ID references
Auto-generate kebab-case IDs in autofix mode (set-0, intersection-1)

Code Quality:

Removed dead code after return statements
Proper type handling with dict(entry) instead of unchecked casts
22 tests covering all validation paths

CI is green, ready for merge.

- Remove dead code after return in _normalize_intersection_expr - Add empty sets list validation in _normalize_sets_list - Remove pointless type coercion (str() won't produce valid types) - Consolidate GFQL parsing: _normalize_gfql_ops calls _wrap_gfql_expr - Add cross-validation for intersection set ID references - Require id field for sets (server requires it, no auto-generation) - Use precise exception handling (TypeError, ValueError, GFQLValidationError) - Update tests to include required id fields 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Set collections require id field (no auto-generation) - Intersection cross-validation for set ID references - GFQL parsing consolidation with precise exception handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Address review comment #2683393920 - removes unnecessary helper function. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

CollectionsInput includes TypedDicts which need to be cast to Dict[str, Any]. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Server requires IDs for all collection types (used as storage keys). In autofix mode, generate `{type}_{idx}` IDs when missing instead of dropping the collection. This makes simple use cases "just work" while still warning about the missing ID. - Sets get `set_0`, `set_1`, etc. - Intersections get `intersection_0`, `intersection_1`, etc. - Strict mode still rejects missing IDs - Added test for auto-generation behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change from `set_0` to `set-0` for consistency with other ID patterns in the codebase (e.g., kepler's `dataset-{uuid[:8]}`). User-provided IDs can be any string - no validation beyond type check. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add normalize_gfql_to_wire() as canonical GFQL→wire implementation - Simplify collections.py (removed 36 lines of duplicate logic) - Simplify validate_collections.py to call compute/ast - Clean import graph: models → compute → helpers/validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…intersections) - Allow intersections to reference other intersections (backend supports this) - Detect self-references (intersection referencing itself) - Detect cycles in intersection dependencies (A->B->A) - Add tests for DAG validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

AST class-level from_json methods (ASTLet, ASTRemoteGraph, ASTRef, ASTCall) use bare `assert` for required field checks. These raise AssertionError which was not caught by the validation boundary, causing raw crashes instead of proper validation errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Addressed in follow-up commits and subsequent discussion; all associated review threads are resolved and current head passes CI.

lmeyerov commented Jan 7, 2026

View reviewed changes

Comment thread graphistry/PlotterBase.py

mj3cheun reviewed Jan 12, 2026

View reviewed changes

Comment thread graphistry/PlotterBase.py Outdated

mj3cheun reviewed Jan 12, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py

mj3cheun reviewed Jan 12, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py Outdated

mj3cheun reviewed Jan 12, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py Outdated

mj3cheun reviewed Jan 12, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py Outdated

mj3cheun reviewed Jan 12, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py

lmeyerov force-pushed the feat/collections-support branch from 3cf9e9d to 41a33c4 Compare January 16, 2026 17:48

mj3cheun approved these changes Feb 6, 2026

View reviewed changes

aucahuasi reviewed Feb 7, 2026

View reviewed changes

Comment thread graphistry/collections.py Outdated

aucahuasi reviewed Feb 7, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py

aucahuasi reviewed Feb 7, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py Outdated

aucahuasi reviewed Feb 7, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py

aucahuasi reviewed Feb 7, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py

aucahuasi previously requested changes Feb 7, 2026

View reviewed changes

lmeyerov force-pushed the feat/collections-support branch 2 times, most recently from 9cc4506 to 47d87e8 Compare February 8, 2026 06:02

aucahuasi reviewed Feb 10, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py Outdated

aucahuasi reviewed Feb 10, 2026

View reviewed changes

Comment thread graphistry/validate/validate_collections.py Outdated

aucahuasi previously requested changes Feb 10, 2026

View reviewed changes

lmeyerov requested a review from aucahuasi February 24, 2026 20:15

chore: clarify plan location in template

0f41e94

lmeyerov and others added 25 commits April 22, 2026 20:05

feat: wrap collection set expr to gfql chain

1b0a009

Refine collections types and helpers

77a77cc

Fix collections typing for mypy

8b162d7

Validate collections settings inputs

1a6bac5

Simplify collections typing

e6f3a65

Reuse gfql chain normalization in collections helpers

2a58929

Refine collections gfql normalization for mypy

1bc6188

Normalize collections GFQL via Chain and reject Let

2cec916

Avoid Chain.from_json in collections normalization

3241710

Allow Let in collections normalization

335aeac

Simplify collections gfql wrapping

532d442

Slim collections validation helpers

83bce2f

Simplify collections input parsing

5dc3b51

fix: canonicalize collections validation and encoding

208de1f

refactor: simplify collections normalization

410302c

chore: move collections notes to development

388f25e

refactor: inline _coerce_collection_list into _parse_collections_input

23a3eb7

Address review comment #2683393920 - removes unnecessary helper function. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: add type casts for mypy after inlining _coerce_collection_list

6c49273

CollectionsInput includes TypedDicts which need to be cast to Dict[str, Any]. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

lmeyerov force-pushed the feat/collections-support branch from 2f12a79 to 3a4f494 Compare April 23, 2026 03:09

lmeyerov requested a review from mj3cheun April 23, 2026 03:19

lmeyerov merged commit 8fe2df2 into master Apr 23, 2026
121 checks passed

Conversation

lmeyerov commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Validation Behavior

Testing

Review Comments Addressed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mj3cheun commented Jan 12, 2026

Uh oh!

lmeyerov commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

mj3cheun commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmeyerov commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mj3cheun commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aucahuasi left a comment

Choose a reason for hiding this comment

Uh oh!

lmeyerov commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Comments Addressed (commits 23d1f00, d2bdf02, 47d87e8)

HIGH Priority - Fixed

MEDIUM Priority - Fixed

LOW Priority - Fixed

Still Open / Deferred

Key Design Decisions

Uh oh!

Uh oh!

Uh oh!

aucahuasi left a comment

Choose a reason for hiding this comment

Uh oh!

lmeyerov commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lmeyerov commented Dec 29, 2025 •

edited

Loading

mj3cheun commented Jan 12, 2026 •

edited

Loading

lmeyerov commented Jan 12, 2026 •

edited

Loading

mj3cheun commented Jan 12, 2026 •

edited

Loading

lmeyerov commented Feb 8, 2026 •

edited

Loading

Review Comments Addressed (commits `23d1f00`, `d2bdf02`, `47d87e8`)